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Abstract. We propose a novel classification technique whose aim is to 
select an appropriate representation for each datapoint, in contrast to 

the usual approach of selecting a representation encompassing the whole 
dataset. This datum-wise representation is found by using a sparsity 
inducing empirical risk, which is a relaxation of the standard Lq regular- 
ized risk. The classification problem is modeled as a sequential decision 
process that sequentially chooses, for each datapoint, which features to 
use before classifying. Datum- Wise Classification extends naturaUy to 
multi-class tasks, and wc describe a specific case where our inference has 
equivalent complexity to a traditional linear classifier, while still using 
a variable number of features. We compare our classifier to classical Li 
regularized linear models (Li-SVM and LARS) on a set of common bi- 
nary and multi-class datasets and show that for an equal average number 
of features used we can get improved performance using our method. 

1 Introduction 

Feature Selection is one of the main contemporary problems in Machine Learning 
and has been approached from many directions. One modern approach to feature 
selection in linear models consists in minimizing an Lq regularized empirical risk. 
This particular risk encourages the model to have a good balance between a 
low classification error and high sparsity (wh(^rc only a few features arc used for 
classification) . As the regularized problem is combinatorial, many approaches 
such as the LASSO [1] try to address the combinatorial problem by using more 
practical norms such as Li. These approaches have; bc;cn developed with two 
main goals in mind: restricting the number of features for improving classification 
speed, and limiting the used features to the most useful to prevent overfitting. 
These classical approaches to sparsity aim at finding a sparse representation of 
the features space that is global to the entire dataset. 

* This work was partially supported by the French National Agency of Research (Lam- 
pada ANR-09-EMER-007). 
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We propose a new approach to sparsity where the goal is to hmit the num- 
ber of features per datapoint, thus datum-wise sparse classification (DWSC). 
This means that our approach aUows the choice of features used for classifica- 
tion to vary relative to each datapoint; data points that are easy to classify can 
be inferred on without looking at very many features, and more difficult data- 
points can be classified using more features. The underlying motivation is that, 
while classical approaches balance between accuracy and sparsity at the dataset 
level, our approach optimizes this balance at the individual datum level, thus 
resulting in equivalent accuracy at higher overall sparsity. This kind of sparsity 
is interesting for several reasons: First, simpler explanations are always to be 
preferred as per Occam's Razor. Second, in the knowledge extraction process, 
such datum-wise sparsity is able to provide unique information about the under- 
lying structure of the data space. Typically, if a dataset is organized onto two 
different subspaces, the datum-wise sparsity principle will allows the model to 
automatically choose to classify using only the features of one or another of the 
subspace. 

DWSC considers feature selection and classification as a single sequential 
decision process. The classifier iteratively chooses which features to use for clas- 
sifying each particular datum. In this sequential decision process, datum-wise 
sparsity is obtained by introducing a penalizing reward when the agent chooses 
to incorporate an additional feature into the decision process. The model is 
learned using an algorithm inspired by Reinforcement Learning [2]. 

The contributions of the paper are threefold: (i.) We propose a new approach 
where classification is seen as a sequential process where one has to choose which 
features to use depending on the input being inferred upon, (ii.) This new ap- 
proach results in a model that obtains good performance in terms of classification 
while maximizing datum-wise sparsity, i.e. the mean number of features used for 
classifying the whole dataset. It also naturally handles multi-class classification 
problems, solving them by using as few features as possible for all classes com- 
bined, (iii.) We perform a series of experiments on 14 different corpora and 
compare the model with those obtained by the LARS [3], and a Li-regularized 
SVM, thus providing a qualitative study of the behaviour of our algorithm. 

The paper is organized as follow: First, we define the notion of datum-wise 
sparse classifiers and explain the interest of such models in Section 2. We 
then describe our sequential approach to classification and detail the learning 
algorithm and the complexity of such an algorithm in Section 3. We describe 
how this approach can be extended to multi-class classification in Section 4. We 
detail experiments on 14 datascts, and also give a qualitative analysis of the 
behaviour of this model in Section 6. The related work is given in Section 7. 

2 Datum- Wise Sparse Classifiers 

We consider the problem of supervised multi-class classification"^ where one wants 
to learn a classification function fg:X^yto associate one category y G y to 

^ Note that this includes the binary supervised classification problem as a special case. 
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a vector x G A", where X = M", n being the dimension of the input vectors. 9 is 
the set of parameters learned from a training set composed of input /output pairs 
Train = { (^i , 2/i ) }ig [i..Ar] . Thesc parameters are commonly found by minimizing 
the empirical risk defined by: 



where A is the loss associated to a prediction error. 

This empirical risk minimization problem does not consider any prior as- 
sumption or constraint concerning the form of the solution and can result in 
overfitting models. Moreover, when facing a very large number of features, ob- 
tained solutions usually need to perform computations on all the features for 
classifying any input, thus negatively impacting the model's classification speed. 
We propose a different risk minimization problem where we add a penalization 
term that encourages the obtained classifier to classify using on average as few 
features as possible. In comparison to classical Lq or Li regularized approaches 
where the goal is to constraint the number of features used at the dataset level, 
our approach performs sparsity at the datum level, allowing the classifier to use 
different features when classifying different inputs. This results in a datum- wise 
sparse classifier that, when possible, only uses a few features for classifying 
easy inputs, and more features for classifying difficult or ambiguous ones. 

We consider a different type of classifier function that, in addition to predict- 
ing a label y given an input x, also provides information about which features 
have been used for classification. Let us denote Z = {0; 1}". We define a datum- 
wise classification function / of parameters 9 as: 



where y is the predicted output and z is a n-dimensional vector z = (z^ , z"'), 
where = 1 implies that feature i has been taken into consideration for comput- 
ing label y on datum x. By convention, we denote the predicted label as ye{x) 
and the corresponding z- vector as ze{x.). Thus, if Zg{x.) = 1, feature i has been 
used for classifying x into category ye{x). 

This definition of data-wise classifiers has two main advantages: First, as we 
will see in the next section, because fg can explain its use of features with ze{x.), 
we can add constraints on the features used for classification. This allows us to 
encourage datum-wise sparsity which we define below. Second, while this is not 
the main focus of our article, analysis of ^^(x) gives a qualitative explanation 
of how the classification decision has been made, which we study in Section 6. 
Note that the way we define datum-wise classification is an extension to the 
usual definition of a classifier. 
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2.1 Datum- Wise Sparsity 

Datum-wise sparsity is obtained by adding a penalization term to the empirical 
loss defined in equation (1) that limits the average number of features used for 
classifying: 

1 ^ 1 ^ 

r =argmin-^Z\(y9(xi),2/,) + A-^||ze(xi)||o. (2) 

^ i=i 1=1 

The term ||ze(xi)||o is the Lq norm ^ of ze(xi), i.e. the number of features selected 
for classifying Xi, that is, the number of elements in zg(jx.i) equal to 1. In the 
general case, the minimization of this new risk results in a classifier that on 
average selects only a few features for classifying, but may use a different set of 
features w.r.t to the input being classified. We consider this to be the crux of 
the DWSC model: the classifier takes each datum into consideration differently 
during the inference process. 




Fig. 1. The sequential process for a problem with 4 features (fi, ...,£4) and 3 possible 
categories (j/i, ...,1/3). Left: The gray circle is the initial state for one particular input 
X. Small circles correspond to terminal states where a classification decision has been 
made. In this example, the classification (bold arrows) has been made by sequentially 
choosing to acquire feature 3 then feature 2 and then to classify x in category yi . The 
bold (red) arrows correspond to the trajectory made by the current policy. Right: 
The value of 2e(x) for the different states are illustrated. The value on the arrows 
corresponds to the immediate reward received by the agent assuming that x belongs to 
category yi. At the end of the process, the agent has received a total reward of — 2A. 



Note that the optimization of the loss defined in equation (2) is a combina- 
torial problem that cannot be easily solved. In the next section of this paper, we 

* The I/O 'norm' is not a proper norm, but we will refer to it as the Lo norm in this 
paper, as is common in the sparsity community. 
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propose an original way to deal with this problem, based on a Markov Decision 
Process. 

3 Datum- Wise Speirse Sequential ClEissification 

Wc consider a Markov Decision Problem (MDP, [4])^ to classify an input x e M". 
At the beginning, we have no information about x, that is, we have no at- 
tribute/feature values. Then, at each step, we can choose to acquire a particular 
feature of x, or to classify x. The act of classifying x in the category y ends an 
"episode" of the sequential process. The classification process is a deterministic 
process defined by: 

— A set of states X x where state (x, z) corresponds to the state where the 
agent is currently classifying datum x and has selected features specified by 
z. The number of currently selected features is thus ||z||o. 

— A set of actions A where ^(x. z) denotes the set of possible actions in state 
(x, z). We consider two types of actions: 

• Af is the set of feature selection actions ^/ = {fi, . . . , fn} such that, for 
a S a = fj corresponds to choosing feature j. Action fj corresponds to 
a vector with only the j^^ element equal to 1, i.e. fj = (0, . . . , 1, . . . , 0). 
Note that the set of possible feature selection actions on state (x, z), 
denoted w4/(x, z), is equal to the subset of currently unselected features, 
i.e. ^/(x, z) = {fj, s.t. Zj = 0}. 

• Ay is the set of classification actions Ay —y, that correspond to assign- 
ing a label to the current datum. Classification actions stop the sequential 
decision process. 

— A transition function defined only for feature selection actions (since classi- 
fication actions are terminal): 



where z' is an updated version of z such that z' = z + fj . 

Policy We define a parameterized policy Trg, which, for each state (x, z), returns 
the best action as defined by a scoring function sg{x,z,a): 



The policy ng decides which action to take by applying the scoring function to 
every action possible from state (x, z) and greedily taking the highest scoring 
action. The scoring function reflects the overall quality of taking action a in 



T: 




■jT$ : X X Z ^ A and 7to{x, z) = argmaxse(x, z, a). 



a 



^ The MDP is deterministic in our case. 
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state (x, z), which corresponds to the total reward obtained by taking action a 
in (x, z) and thereafter following policy tt^i^: 



Here (r^ | (x, z), a) corresponds to the reward obtained at step t while having 
started in state (x, z) and followed the policy with parameterization 9 for t steps. 
Taking the sum of these rewards gives us the total reward from state (x, z) until 
the end of the episode. Since the policy is deterministic, we may refer to a 
parameterized policy using simply 9. Note that the optimal parameterization 9* 
obtained after learning (see Sec. 3.3) is the parameterization that maximizes the 
expected reward in all state-action pairs of the process. 

In practice, the initial state of such a process for an input x corresponds to 
an empty z vector where no feature has been selected. The policy 9 sequentially 
picks, one by one, a set of features pertinent to the classification task, and then 
chooses to classify once enough features have been considered. 

Reward The reward function reflects the immediate quality of taking action 
a in state (x, z) relative to the problem at hand. We define a reward function 
over the training set (xi,j/i) G T: TZ : X x Z x A ^ M. which reflects how 
good of a decision taking action fj on state (xi, z) for input Xi is relative to our 
classification task. This reward is defined as follows'': 

— If a corresponds to a feature selection action, then the reward is —A. 

— If a corresponds to a classification action i.e. a = y, we have: 



In practice, we set A << 1 to avoid situations where classifying incorrectly is a 
better decision than choosing multiple features. 

3.1 Reward MsLximization and Loss Minimization 

As explained in section 2, our ultimate goal is to find the parameterization 
9* that minimizes the datum- wise empirical loss defined in equation (2). The 
training process for the MDP described above is the maximization of a reward 
function. Let us therefore show that maximizing the reward function is equiva- 
lent to minimizing the datum-wise empirical loss. 



This corresponds to the classical Q-function in Reinforcement Learning. 

Note that we can add — A • ||z||o to the reward at the end of the episode, and give a 

constant intermediate reward of 0. These two approaches are interchangeable. 



T 




r(xi,z, 



y) = Oiiy = y. and = -1 if y 7^ t/j 
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6* = argmin ^ zi(t/e(xi), j/i) + A-^ ^ ||2;0(xi)||o 

^ i=l i=l 

1 ^ 

= argmin — V (Z\(j/e(xi), yi) + A| |2;e(xi)| |o) 
1 ^ 

= argmax — ^ (-Z\(ye(xi), y,) - A||2;e(xi)||o) 



TV 



1 I - A • ||2;e(xi)||o iiy = yi 
-g--]V^|-l-A.|MxO||oif^^y. 

AT T(,(xi) + 1 

argmax — ^ ^ r(xi, 4*''(xi), 7re(xi, 4*^)) 



i=l t=l 

where 7r5)(xi, ■*) is the action taken at time t by the poUcy ttq for the training 
example Xi. 

Such an equivalence between risk minimization and reward maximization 
shows that the optimal classifier 6* corresponds to the optimal policy in the MDP 
defined previously. This equivalence allows us to use classical MDP resolution 
algorithms in order to find the best classifier. We detail the learning procedure 
in Section 3.3. 



3.2 Inference and Approximated Decision Processes 

Due to the infinite number of possible inputs x, the number of states is also 
infinite. Moreover, the reward function r(x, z,a) is only known for the values 
of X that arc in the training set and cannot be computed for any other input. 
For these two reasons, it is not possible to compute the score function for all 
state-action pairs in a tabular manner, and this function has to be approximated. 

The scoring function that underlies the policy S0{x,z,a) is approximated 
with a linear model^: 

s(x, z, a) — (^(x, z, a);6) 

and the policy defined by such a function consists in taking in state (x, z) the 
action a' that maximizes the scoring function i.e a' = argmax„g_4(^(x, z, a); ^). 

Due to their infinitcncss, the state-action pairs arc represented in a feature 
space. We note ^(x, z, a) the featurized representation of the (x, z),a state- 
action pair. Many definitions may be used for this feature representation, but 
we propose a simple projection: we restrict the representation of x to only the 

* Although non-linear models such as neural networks may be used, we have chosen 
to restrict ourselves to a linear model to be able to properly compare performance 
with that of other state-of-the-art linear sparse models. 
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selected features. Let /x(x, z) be the restriction of x according to z: 

At(x,z)* = 



if = 1 
elsewhere 



To be able to differentiate between an attribute of x that is not yet known, 
and an attribute that is simply equal to 0, we must keep the information present 
in z. Let (j){x,z) = (z,/i(x, z)) be the intermediate representation that corre- 
sponds to the concatenation of x with z. Now we simply need to keep the infor- 
mation present in a in a manner that allows each action to be easily distinguished 
by a linear classifier. To do this we use the block-vector trick [5] which consists 
in projecting (/)(x, z) into a higher dimensional space such that the position of 
^(x, z) inside the global vector <?(x, z, , a) is dependent on action a: 

^(x,z,a) = (0,...,0,#(x,z),0,...,0). 

In ^(x,z,a), the block (j){x.,z) is at position ia • |(/)(x, z)| where ia is the index 
of action a in the set of all the possible actions. Thus, (/>(x, z) is offset by an 
amount dependent on the action a. 



3.3 Learning 

The goal of the learning phase is to find an optimal policy parameterization 9* 
which maximizes the expected reward, thus minimizing the datum-wise regu- 
larized loss defined in (2). As explained in Section 3.2, we cannot exhaustively 
explore the state space during training, and therefore we use a Monte-Carlo 
approach to sample example states from the learning space. We use the Approx- 
imate Policy Iteration (API) algorithm with rollouts [6]. Sampling state-action 
pairs according to a previous policy 7rg(t-i), API consists in iteratively learning 
a better policy Wgit) by way of the Bellman equation. The API With Rollouts 
algorithm is composed of three main steps that are iteratively repeated: 

1. The algorithm begins by sampling a set of random states: the x vector is 
sampled from a uniform distribution in the training set, and z is also sampled 
using a uniform binomial distribution. 

2. For each state in the sampled state, the policy wgit-i) is used to compute the 
expected reward of choosing each possible action from that state. We now 
have a feature vector <?(x, z, a) for each state-action pair in the sampled set, 
and the corresponding expected reward denoted iZg(t-i) (x, z, a). 

3. The parameters 9^*^ of the new policy arc then computed using classical 
linear regression on the set of states — <?(x, z, a) — and corresponding ex- 
pected rewards — Rg(t-i){x,z,a) — obtained previously. The generalizing 
capacity of the classifier gives an estimated score to state-action pairs even 
if we have never visited them. 

After a certain number of iterations, the parameterized policy converges to a 
final policy tt which is used for inference. 
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4 Preventing Overfitting in the Sequential Model 

In section 3, we explain the process by which, at each step, we either choose a 
new feature or classify the current datum. This process is at the core of DWSC 
but can suffer from overfitting if the number of features is larger than the number 
of training examples. In such a case, DWSC would tend to learn to select the 
more specific features for each training example. In classical Li regularization 
models that arc not datum-wise, the classifier must use the same set of features 
for classifying any data and thus overly specific features are not chosen because 
they usually appear in only a few training examples. 

We propose a very simple variant of the general model that allows us to 
avoid overfitting. We still allow DWSC to choose how many features to use 
before classifying an input x, but we constrain it to choose the features in the 
same order for all the inputs. For that, wc constrain the score of the feature 
selection actions to depend only on the vector z of the state (x, z). An example 
of the effect of such a constraint is presented in Fig. 2. This constraint is handled 
in the following manner: 

j sg{y:,z,a) = se{z,a) if a G Af 
I S6i(x, z,aj = S0{x,z,a) it a e ^j, 

where se(x, z,a) = se(z,a) implies that the score is computed using only the 
values of z and a — x is ignored. This corresponds to having two different types 
of state-action feature vectors # depending on the type of action: 

fif ae>l/,#(x,z,a) = (0,...,0,z,0,...,0) 

I if a e Ay,^{x, z, a) = (0, . . . , 0, z, #(x, z), 0, . . . , 0) 
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Fig. 2. Difference between the base Unconstrained Model (DWSM-Un) and the Con- 
strained Model (DWSM-Con) described in section 4. The figure shows, for 4 different 
inputs xi, X4 the features selected by the classifiers before classification. One can see 
that the Constrained Model chooses the features in the same order for all the inputs. 



Although this constraint forces DWSC to choose the features in the same 
order, it will still automatically learn the best order in which to choose the 
features, and when to stop adding features and classify. However, it will avoid 
choosing very different features sets for classifying different inputs (the first 
features chosen will be common to all the inputs being classified) and thus avoid 
the overfitting problem. 
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5 Complexity Analysis 

Learning Complexity: As explained in section 3.3, the learning method is based 
on Reinforcement Learning with Rollouts. Such an approach is expensive in term 
of computations because it needs — at each iteration of the algorithm — to sim- 
ulate trajectories in the decision process, and then to learn the scoring function 
So based on these trajectories. Without giving the details of the computation, 
the complexity of each iteration is 0{Ns ■ (n^ + c)), where Ns is the number 
of states used for rollouts (which in practice is proportional to the number of 
training examples) , n is the number of features and c is the number of possible 
categories. This implies a learning method which is quadratic w.r.t. the num- 
ber of features; the proposed approach is not able to deal with problems with 
thousands of possible features. Breaking this complexity is an active research 
perspective with some leads. 

Inference Complexity: Inference on an input x consists in sequentially choos- 
ing features, and then classifying x. At step t, one has to perform {n — t) + c 
linear computations in order to choose the best action, where {n — t) + cis the 
number of possible actions when t features have already been acquired. The in- 
ference complexity is thus 0{Nf ■ {n + c)), where Nf is the mean number of 
features chosen by the system before classifying. In fact, due to the shape of the 
4> function presented in Section 3.2 and the linear nature of sg, the score of the 
actions can be efficiently incrementally computed at each step of the process by 
just adding the contribution of the newly added feature. The complexity is thus 
reduced to 0{n + c). Moreover, the constrained model which results in ordering 
the features, has a lower complexity of 0(c) because in that case, the model does 
not have to choose between the different remaining features, and has only the 
choice to classify or get the next feature w.r.t. to the learned order. 

If the learning complexity of our model is higher than baseline global linear 
methods, the inference speed is very close for the unconstrained model, and 
equivalent for the constrained one. In practice, most of the baseline methods 
choose a subset of variables in a couple seconds to a couple minutes, whereas 
our method takes from a dozen minutes to an hour, depending on the number of 
features and categories. In practice inference is indeed of the same speed, which 
is in our opinion the important factor. 

6 Experiments 

Experiments were run on 14 different datasets obtained from the LibSVM Web- 
site^. Ten of these datasets correspond to a binary classification task, four to a 
multi-class problem. The datasets are described in Table 1. For each dataset, we 
randomly sampled different training sets by taking from 5% to 75% of the exam- 
ples as training examples, with the remaining examples being kept for testing. 
We performed experiments with three different models: Ll-SVM was used as 



® http : / /www . csie . ntu . edu . tw/~c j lin/libsvmtools/datasets/ 
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Fig. 3. Accuracy w.r.t. to sparsity. In both plots, the left side on the x-axis corresponds 
to a low sparsity, while the right side corresponds to a high sparsity. The performances 
of the models are usually decreasing when the sparsity increases, except in case of 
overfitting. 
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14 


690 
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Binary 


Breast Cancer 


10 


683 
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Binary 


Diabetes 
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768 


2 


Binary 


German Numer 


24 


1,000 


2 


Binary 


Heart 


13 


270 


2 


Binary 




34 


351 


2 


Binary 


Liver Disorders 


6 


345 


2 


Binary 


Sonar 


60 


208 


2 


Binary 




60 


1,000 


2 


Binary 


Svm Guide 3 


21 


1,284 


2 


Binary 


Segment 


19 


2,310 


7 


Multiclass 


Vehicle 


18 


846 


4 


Multiclass 


Vowel 


10 


1,000 


11 


Multiclass 


Wine 


13 


178 


3 


Multiclass 



Table 1. Datasets used for the experiments. 



a baseline linear model with Li regularization^". LARS was used to obtain the 
optimal solution of the LASSO problem for all values of the regularization co- 
efficient A at once^^. Datum- Wise Sequential Model (DWSM) was tested 
with the two versions presented above: (i) DWSM-Un is the original uncon- 
strained model and (ii) DWSM-Con is the constrained model for preventing 
overfitting. 

For the evaluation, we used a classical accuracy measure which corresponds 
to 1 — error rate on the test set of each dataset. We perform 3 training/testing 
set splits of a given dataset to obtain averaged figures. The sparsity has been 
measured as the proportion of features not used for Li-SVM and LARS in binary 
classification, and the mean proportion of features not used to classify testing 
examples in DWSM. For multi-class problems where one LARS/SVM model 



Using LIBLINEAR [7]. 

We use the implementation from the authors of the LARS, available in R. 
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0.05 




70.58 


70.47 


67.36 


70.74 


70.39 


69.95 


69.99 


70.28 


69.73 




0.1 




69.82 


69.62 


69.10 


70.81 


70.39 


71.52 


71.79 


0.00 


72.85 


german.numer 


0.25 




72.25 


72.00 


65.98 


72.67 


73.26 


72.89 


73.10 


0.00 


74.11 




0.5 




70.03 


70.62 


69.72 


71.50 


72.37 


71.97 


72.96 


74.05 


72.68 




0.05 




48.33 


48.17 


45.42 


51.17 


50.67 


65.73 


0.00 


0.00 


68.24 


heart 
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83 00 
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5 




70 34 


6887 


69 20 


77 15 


80 34 


78 83 


80 48 


80 40 


8058 




0.05 




69.52 


71.36 


73.55 


73.44 


73.02 


72.23 


74.77 


75.16 


72.59 


ionosphere 


0.1 




71.58 


71.09 


71.84 


75.12 


74.63 


75.89 


74.97 


74.93 


74.49 


0.25 




79.65 


80.29 


75.94 


85.18 


85.44 


81.58 


85.58 


85.69 


82.78 




0.5 




77.31 


78.40 


71.15 


82.94 


82.68 


78.18 


84.96 


84.16 


79.64 




0.05 




60.40 


59.37 


57.01 


60.07 


61.25 


57.01 


60.29 


64.27 


57.74 


liver-disorders 


0.1 




56.70 


56.24 


55.41 


55.85 


55.98 


56.43 


56.69 


55.00 


55.86 


0.25 




56.69 


56.14 


54.18 


58.07 


57.02 


55.10 


58.69 


57.97 


56.93 




0.5 




58.93 


59.55 


60.84 


60.10 


58.81 


60.96 


59.33 


60.84 


61.33 




0.05 




57.59 


59.95 


64.14 


68.50 


66.49 


65.15 


69.45 


70.48 


61.24 


sonar 


0.1 




61.69 


64.40 


64.12 


68.68 


73.93 


64.12 


74.25 


75.20 


63.53 


0.25 




67.32 


64.74 


67.52 


73.52 


70.63 


74.52 


75.22 


73.36 


72.82 




0.5 




68.19 


64.71 


65.77 


72.18 


69.76 


69.37 


73.73 


71.60 


65.77 




0.05 




67.23 


68.41 


67.82 


70.14 


68.66 


65.93 


70.51 


69.89 


64.47 


splice 


0.1 




66.90 


66.87 


61.46 


70.35 


67.99 


62.63 


71.05 


70.07 


61.62 


0.25 




73.87 


73.89 


70.49 


74.81 


75.30 


72.28 


75.60 


76.64 


71.74 




0.5 




72.86 


76.79 


72.78 


74.98 


77.88 


70.36 


77.09 


0.00 


69.35 




0.05 




77.15 


77.13 


77.17 


77.32 


77.25 


78.25 


77.48 


77.37 


78.21 


svmguideS 


0.1 




77.31 


77.28 


76.59 


77.94 


78.11 


78.95 


78.58 


78.94 


78.37 


0.25 




76.67 


76.56 


75.96 


77.44 


77.14 


77.40 


78.21 


77.72 


77.91 




0.5 




77.71 


77.78 


76.87 


78.55 


78.63 


78.15 


79.38 


79.47 


78.37 



Table 2. This table contains the accuracy of each model on the binary classification 
problems depending on three levels of sparsity (80%, 60%, and 40%) using different 
training sizes. The accuracy has been linearly interpolated from curves like the ones 
given in Figure 3. 



is learned for each category, the sparsity is the proportion of features that have 
not been used in any of the models. 

For the sequential experiments, the number of rollout states (step 1 of the 
learning algorithm) has been set to 2,000 and the number of policy iterations has 
been fixed to 10. Note that experiments with more rollout states and/or more 
iterations give similar results. Experiments were made using an alpha mixture 
policy with a = 0.9 to ensure the stability of the learning process. We tested 
the different models with different values of A which controls the sparsity. Note 
that even with a A = value, contrary to the baseline models, the DWSM model 
docs not use all of the features for classification. 

6.1 Results 

For each corpus and each training size, we have computed sparsity /accuracy 
curves showing the performance of the different models w.r.t. to the sparsity of 
the solution. Only two representative curves are given in Figure 3. To summarize 
the performances over all the datasets, we give the accuracy of the different 
models for three levels of sparsity in tables 2 and 3. Due to a lack of space, 
these tables do not present the LARS' performance, which are equivalent to 
the performances of the Li-SVM. Note that in order to obtain the accuracy 
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Corpus 


Train Size 




S 


paraity = 


8 


Sparsity = 0.6 


S 


par.sity = 0,4 










DWSM-Un 


DWSAI-Co 


n Ll-SVM 


DWSM-U 


n DWSM-Con 


Ll-SVM 


DWSM-Ui 


DWSM-Con 


Ll-SVM 




0.1 




42.06 


41.23 


35.31 


53.87 


53,02 


45,49 




56.57 


56,98 




0.2 




40.76 


40.17 


40.48 


55-70 


56.34 


45,97 


57.42 


59.10 


53.24 




0.5 




43.29 


0.00 


37.17 


64.09 


0.00 


45,15 


56.43 


0,00 


50,52 




0.75 




43.78 


41.13 


38.22 


55.10 


53.60 


44,80 


56,54 


56.99 


47,00 




0.1 




34.23 


37.52 


43.36 


43.50 


45.34 


50.25 


47,21 


0,00 


56.54 


vehicle 


0.2 




38.32 


39.27 


53.04 


45.84 


45.68 


53.36 


48,68 


47,91 


52.83 


0.5 




39.74 


39.51 


42.95 


46.64 


47.57 


50.30 


0,00 


48,40 


51.99 




0.75 




40.32 


40.37 


41.04 


49.96 


49.31 


53.68 


51,86 


51,53 


53.77 




0.1 




18.03 


19.27 


9.83 


24. IT 


22.82 


16,24 


25,28 


25.80 


18,38 


vowel 


0.2 




0.00 


15.27 


14.71 




20.17 


15,93 


0,00 


22.59 


15,93 


0.5 




18.98 


17.81 


9.57 


24.56 


25.33 


17,73 


28.45 


27,31 


23,76 




0.75 




19.85 


19.49 


14.41 


28.01 


31.45 


24,58 


32,09 


32.74 


26,69 




0.1 




70.22 


70.66 


73.58 


76.42 


77.87 


89.38 


78,66 


76,67 


91.36 


wine 


0.2 




71.52 


72.68 


80.34 


78.27 


79.11 


92.12 


78,76 


77,72 


94.16 


0.5 




72.99 


74.41 


74.40 


79.43 


80.60 


86.90 


82,15 


79,50 


91.38 




0.75 




76.21 


75.04 


72.00 


80.18 


81.84 


94.00 


83,23 


80,93 


96.00 



Table 3. This table contains the accuracy of each model on the multi-class classification 
problems depending on three levels of sparsity (80%, 60%, and 40%) using different 
training sizes. 



for a given level of sparsity, we have computed a linear interpolation on the 

different curves obtained for each corpus and each training size. This linear 
interpolation allows us to compare the baseline sparsity methods — that choose 
a fixed number of features — with the average number of features chosen by 
DWSC This compares the average amount of information considered by each 
classifier. We believe this approach still provides a good appreciation of the 
algorithm's capacities. 

Table 2 shows that, for a sparsity level of 80%, the DW^SM-Un and the 
DW^SM-Con models outperform the baseline Li-SVM classifier. This is particu- 
larly true for 7 of the 10 datasets while the results are more ambiguous on the 
three others datasets: breast, ionosphere and sonar. For a sparsity of 40%, simi- 
lar results are obtained. Depending on the corpus and the training size, different 
configurations are observed. Some datasets can be easily classified using only a 
few features, such as australian for example. In that case, our approach gives 
similar results in comparison to Li methods (see Figure 3-left). For some other 
datasets, our method clearly outperforms baseline methods (Figure 3 right). On 
the splice dataset, our model is better than the best (non-sparse) SVM using 
only less than 20% of the features on average. This is due to the fact that our 
sequential process, which solves a different classification problem, is more ap- 
propriate for some particular datasets, particularly when the distribution of the 
data is split up amongst distinct subspaces. In this case, our model is able to 
choose more appropriate features for each input. 

When using small training sets with some datasets — sonar or ionosphere 
— where overfitting is observed (accuracy decreases with more features used), 
the DW^SM-Con seems to be a better choice than the unconstrained version and 
thus is a version of the algorithm that is well-suited when the number of learning 
examples is small. 
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Concerning the multi-class problems, similar effects can be observed (see 
Table 3). The model seems particularly interesting when the number of categories 
is high, as in segment and vowel. This is due to the fact that the average sparsity 
is optimized by the sequential model for the multi-class problem while Li-SVM 
and LARS, which need to learn one model for each category, perform separate 
sparsity optimizations for each class. 




Number of Features acquired before classification 



Fig. 4. Breast- Cancer, training size = 10%, Sparsity ~ 50 % Left: The distribution 
of use of each feature. For example, DWSM-Con uses feature 2 for classifying 100% of 
the test examples, while DWSM-Un uses this feature for classifying only 88% of the 
examples. Right: The mean proportion of features used for classifying. For example 
DWSM-Con classifies 42% of the examples using exactly 2 features while DWSM-Un 
classifies 21% of the examples using exactly 2 features. 

Figure 4 gives some qualitative results. First, from the left histogram, one 
can see that some features are used in 100% of the decisions. This illustrates 
the ability of the model to detect important features that must be used for 
decision. Note that many of these features are also used by the Li-SVM and the 
LARS models. The sparsity gain in comparison to the baseline model is obtained 
through the features 1 and 9 that are only used in about 20% of decisions. From 
the right histogram, one can see that the DWSM model mainly classifies using 
1, 2, 3 or 10 features, showing that the model is able to adapt its behaviour to 
the difficulty of classifying a particular input. This is confirmed by the green 
and violet histograms that show that for incorrect decisions (i.e. very difficult 
inputs) the classifier almost always acquires all the features before classifying. 
These difficult inputs seem to have been identified, but the set of features is not 
sufficient for a good understanding. This behaviour opens appealing research 
directions concerning the acquisition and creation of new features (see Section 
8). 



7 Related Work 

Feature selection comes in three main flavors [8]: wrapper, filter, or embedded 
approaches. Wrapper approaches involve searching the feature space for an 
optimal subset of features that maximize classifier performance. The feature 
selection step wraps around the classifier, using the classifier as a black-box 
evaluator of the selected feature subset. Searching the entire feature space is 
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very quickly intractable and therefore various approaches have been proposed to 
restrict the search (see [9,10]). The advantage of the wrapper approaches is that 
the feature subset decision can take into consideration feature inter-dependencies 
and avoid redundant features, however the problem remains of the exponential 
size of the search space. Filter approaches rank the features by some scor- 
ing function independent of their effect on the associated classifier. Since the 
choice of features is not influenced by classifier performance, filter approaches 
rely purely on the adequacy of their scoring functions. Filtering methods are 
susceptible to not discriminating redundant features, and missing feature inter- 
dependencies (since each feature is scored individually). Filter approaches are 
however easier to compute and more statistically stable relative to changes in 
the datascit. Embedded approaches include feature selection as part of the 
learning machine. These include algorithms solving the LASSO problem [1], and 
other linear models involving a regularizcr based on a sparsity inducing norm 
(^pg[0;i] -norms [11], group LASSO, ...). Kernel machines provide a mixture of 
feature selection and construction as part of the classification problem. Decision 
trees are also considered embedded approaches although they are also similar to 
filter approaches in their use of heuristic scores for tree construction. The main 
critique of embedded approaches is two-fold: they are susceptible to include 
redundant features, and not all the techniques described are easily applied to 
multi-class problems. In brief, both filtering and embedded approaches have their 
drawbacks in terms of their ability to select the best subset of features, whereas 
wrapper methods have their main drawback in the intractability of searching the 
entire feature space. Furthermore, all existing methods perform feature selection 
based on the whole training set, the same set of features being used to represent 
any data. 

Our sequential decision problem defines both feature selection and classifi- 
cation tasks. In this sense, our approach resembles an embedded approach. In 
practice, however, the final classifier for each single datapoint remains a sepa- 
rate entity, a sort of black-box classifying machine upon which performance is 
evaluated. Additionally, the learning algorithm is free to navigate over the en- 
tire combinatorial feature space. In this sense our approach resembles a wrapper 
method. 

There has been some work using similar formalisms [12], but with different 
goals and lacking in experimental results. Sequential decision approaches have 
been used for cost-sensitive classification with similar models [13]. There have 
also been applications of Reinforcement Learning to optimize anytime classifica- 
tion [14]. We have previously looked at using Reinforcement Learning for finding 
a stopping point in feature quantity during text classification [15]. 

Finally, in some sense, DWSC has some similarity with decision trees as each 
new datapoint that is labeled is following a different path in the feature space. 
However, the underlying mechanism is quite different both in term of inference 
procedure and learning criterion. There has been some work in using RL for 
generating decision trees [16], but that approach is still tied to decision tree 
construction heuristics and the end product remains a decision tree. 
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8 Conclusion 

In this article we introduced the concept of datum-wise classification, where we 

learn both a classifier and a sparse representation of the data that is adaptive 
to each new datum being classified. We took an approach to sparsity that con- 
siders the combinatorial space of features, and proposed a sequential algorithm 
inspired by Reinforcement Learning to solve this problem. We showed that find- 
ing an optimal policy for our Reinforcement Learning problem is equivalent to 
minimizing the Lq regularized loss of our classification problem. Additionally we 
showed that our model works naturally on multi-class problems, and is easily 
extended to avoid overfitting on datasets where the number of features is larger 
than the number of examples. Experimental results on 14 datasets showed that 
our approach is indeed able to increase sparsity while maintaining equivalent 
classification accuracy. 
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